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Abstract 

We  present  a  simple  but  aeeurate  parser  whieh  exploits  both  large  tree  fragments  and  sym¬ 
bol  refinement.  We  parse  with  all  fragments  of  the  training  set,  in  eontrast  to  mueh  reeent 
work  on  tree  seleetion  in  data-oriented  parsing  and  tree-substitution  grammar  learning.  We 
require  only  simple,  deterministie  grammar  symbol  refinement,  in  eontrast  to  reeent  work  on 
latent  symbol  refinement.  Moreover,  our  parser  requires  no  explieit  lexieon  maehinery,  in¬ 
stead  parsing  input  sentenees  as  eharaeter  streams.  Despite  its  simplieity,  our  parser  aehieves 
aeeuraeies  of  over  88%  FI  on  the  standard  English  WSJ  task,  whieh  is  eompetitive  with  sub¬ 
stantially  more  eomplieated  state-of-the-art  lexiealized  and  latent-variable  parsers.  Additional 
speeifie  eontributions  eenter  on  making  implieit  all-fragments  parsing  effieient,  ineluding  a 
eoarse-to-fine  inferenee  seheme  and  a  new  graph  eneoding.^ 


1  Introduction 


Modem  NLP  systems  have  increasingly  used  data-intensive  models  that  capture  many  or  even  all 
substructures  from  the  training  data.  In  the  domain  of  syntactic  parsing,  the  idea  that  all  training 
fragments^  might  be  relevant  to  parsing  has  a  long  history,  including  tree- substitution  grammar 
(data-oriented  parsing)  approaches  [Scha,  1990,  Bod,  1993,  Goodman,  1996a,  Chiang,  2003]  and 
tree  kernel  approaches  [Collins  and  Duffy,  2002].  For  machine  translation,  the  key  modem  ad¬ 
vancement  has  been  the  ability  to  represent  and  memorize  large  training  substructures,  be  it  in 
contiguous  phrases  [Koehn  et  ah,  2003]  or  syntactic  trees  [Galley  et  ah,  2004,  Chiang,  2005,  De- 
neefe  and  Knight,  2009].  In  all  such  systems,  a  central  challenge  is  efficiency:  there  are  generally 
a  combinatorial  number  of  substructures  in  the  training  data,  and  it  is  impractical  to  explicitly  ex¬ 
tract  them  all.  On  both  efficiency  and  statistical  grounds,  much  recent  TSG  work  has  focused  on 
fragment  selection  [Zuidema,  2007,  Cohn  et  ah,  2009,  Post  and  Gildea,  2009]. 

At  the  same  time,  many  high-performance  parsers  have  focused  on  symbol  refinement  approaches, 
wherein  PCFG  independence  assumptions  are  weakened  not  by  increasing  rule  sizes  but  by  subdi¬ 
viding  coarse  treebank  symbols  into  many  subcategories  either  using  structural  annotation  [John¬ 
son,  1998,  Klein  and  Manning,  2003]  or  lexicalization  [Collins,  1999,  Chamiak,  2000].  Indeed, 

'This  work  was  published  in  the  48th  Annual  Meeting  of  the  Association  for  Computational  Linguistics  (ACL). 
See  Bansal  and  Klein  [2010]. 

^In  this  work,  a  fragment  means  an  elementary  tree  in  a  tree-substitution  grammar,  while  a  subtree  means  a  frag¬ 
ment  that  bottoms  out  in  terminals. 
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a  recent  trend  has  shown  high  accuracies  from  models  which  are  dedicated  to  inducing  such  sub¬ 
categories  [Henderson,  2004,  Matsuzaki  et  ah,  2005,  Petrov  et  ah,  2006].  In  this  work,  we  present 
a  simplified  parser  which  combines  the  two  basic  ideas,  using  both  large  fragments  and  symbol 
refinement,  to  provide  non-local  and  local  context  respectively.  The  two  approaches  turn  out  to 
be  highly  complementary;  even  the  simplest  (deterministic)  symbol  refinement  and  a  basic  use 
of  an  all-fragments  grammar  combine  to  give  accuracies  substantially  above  recent  work  on  tree- 
substitution  grammar  based  parsers  and  approaching  top  refinement-based  parsers.  For  example, 
our  best  result  on  the  English  WSJ  task  is  an  FI  of  over  88%,  where  recent  TSG  parsers^  achieve 
82-84%  and  top  refinement-based  parsers"^  achieve  88-90%  (e.g..  Table  5). 

Rather  than  select  fragments,  we  use  a  simplification  of  the  PCFG-reduction  of  DOP  [Goodman, 
1996a]  to  work  with  all  fragments.  This  reduction  is  a  flexible,  implicit  representation  of  the 
fragments  that,  rather  than  extracting  an  intractably  large  grammar  over  fragment  types,  indexes 
all  nodes  in  the  training  treebank  and  uses  a  compact  grammar  over  indexed  node  tokens.  This 
indexed  grammar,  when  appropriately  marginalized,  is  equivalent  to  one  in  which  all  fragments 
are  explicitly  extracted.  Our  work  is  the  first  to  apply  this  reduction  to  full-scale  parsing.  In 
this  direction,  we  present  a  coarse-to-fine  inference  scheme  and  a  compact  graph  encoding  of  the 
training  set,  which,  together,  make  parsing  manageable  (in  terms  of  speed  and  memory).  This 
tractability  allows  us  to  avoid  selection  of  fragments,  and  work  with  all  fragments. 

Of  course,  having  a  grammar  that  includes  all  training  substructures  is  only  desirable  to  the  extent 
that  those  structures  can  be  appropriately  weighted.  Implicit  representations  like  those  used  here 
do  not  allow  arbitrary  weightings  of  fragments.  However,  we  use  a  simple  weighting  scheme 
which  does  decompose  appropriately  over  the  implicit  encoding,  and  which  is  flexible  enough  to 
allow  weights  to  depend  not  only  on  fragment  frequency  but  also  on  fragment  size,  node  patterns, 
and  certain  lexical  properties.  Similar  ideas  have  been  explored  in  Bod  [2001],  Collins  and  Duffy 
[2002],  and  Goodman  [2003].  Our  model  empirically  affirms  the  effectiveness  of  such  a  flexible 
weighting  scheme  in  full-scale  experiments. 

We  also  investigate  parsing  without  an  explicit  lexicon.  The  all-fragments  approach  has  the  advan¬ 
tage  that  parsing  down  to  the  character  level  requires  no  special  treatment;  we  show  that  an  explicit 
lexicon  is  not  needed  when  sentences  are  considered  as  strings  of  characters  rather  than  words. 
This  avoids  the  need  for  complex  unknown  word  models  and  other  specialized  lexical  resources. 

The  main  contribution  of  this  work  is  to  show  practical,  tractable  methods  for  working  with  an 
all-fragments  model,  without  an  explicit  lexicon.  In  the  parsing  case,  the  central  result  is  that  accu¬ 
racies  in  the  range  of  state-of-the-art  parsers  (i.e.,  over  88%  FI  on  English  WSJ)  can  be  obtained 
with  no  sampling,  no  latent-variable  modeling,  no  smoothing,  and  even  no  explicit  lexicon  (hence 
negligible  training  overall).  These  techniques,  however,  are  not  limited  to  the  case  of  monolingual 
parsing,  offering  extensions  to  models  of  machine  translation,  semantic  interpretation,  and  other 
areas  in  which  a  similar  tension  exists  between  the  desire  to  extract  many  large  structures  and  the 
computational  cost  of  doing  so. 

^Including  Zuidema  [2007],  Cohn  et  al.  [2009],  Post  and  Gildea  [2009].  Zuidema  [2007]  incorporates  deterministic 
refinements  inspired  by  Klein  and  Manning  [2003]. 

"^Including  Collins  [1999],  Charniak  and  Johnson  [2005],  Petrov  and  Klein  [2007]. 
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Figure  I :  Grammar  definition  and  sample  derivations  and  fragments  in  the  grammar  for  (a)  the  explicitly  extracted 
all-fragments  grammar  G,  and  (b)  its  implicit  representation  . 


2  Representation  of  Implicit  Grammars 

2.1  All-Fragments  Grammars 

We  consider  an  all-fragments  grammar  G  (see  Figure  1(a))  derived  from  a  binarized  treebank  B. 
G  is  formally  a  tree-substitution  grammar  [Resnik,  1992,  Bod,  1993]  wherein  each  subgraph  of 
each  training  tree  in  B  is  an  elementary  tree,  or  fragment  /,  in  G.  In  G,  each  derivation  d  is  a  tree 
(multiset)  of  fragments  (Figure  1(c)),  and  the  weight  of  the  derivation  is  the  product  of  the  weights 
of  the  fragments:  =  Ufed  ui{f)-  In  the  following,  the  derivation  weights,  when  normalized 

over  a  given  sentence  s,  are  interpretable  as  conditional  probabilities,  so  G  induces  distributions  of 
the  form  P(d\s). 

In  models  like  G,  many  derivations  will  generally  correspond  to  the  same  unsegmented  tree, 
and  the  parsing  task  is  to  find  the  tree  whose  sum  of  derivation  weights  is  highest:  t^ax  = 
arg  maXj  Xlaet optimization  is  intractable  in  a  way  that  is  orthogonal  to  this  work 
[Sima’an,  1996];  we  describe  minimum  Bayes  risk  approximations  in  Section  4. 


2.2  Implicit  Representation  of  G 

Explicitly  extracting  all  fragment-rules  of  a  grammar  G  is  memory  and  space  intensive,  and  im¬ 
practical  for  full-size  treebanks.  As  a  tractable  alternative,  we  consider  an  implicit  grammar  G^ 
(see  Figure  1(b))  that  has  the  same  posterior  probabilities  as  G.  To  construct  Gf  we  use  a  simpli- 
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fication  of  the  PCFG-reduction  of  DOP  by  Goodman  [1996a]. ^  has  base  symbols,  which  are 
the  symbol  types  from  the  original  treebank,  as  well  as  indexed  symbols,  which  are  obtained  by 
assigning  a  unique  index  to  each  node  token  in  the  training  treebank.  The  vast  majority  of  symbols 
in  are  therefore  indexed  symbols.  While  it  may  seem  that  such  grammars  will  be  overly  large, 
they  are  in  fact  reasonably  compact,  being  linear  in  the  treebank  size  B,  while  G  is  exponential 
in  the  length  of  a  sentence.  In  particular,  we  found  that  G^  was  smaller  than  explicit  extraction  of 
all  depth  1  and  2  unbinarized  fragments  for  our  treebanks  -  in  practice,  even  just  the  raw  treebank 
grammar  grows  almost  linearly  in  the  size  of  B.^ 

There  are  3  kinds  of  rules  in  G^ ,  which  are  illustrated  in  Figure  1(b).  The  BEGIN  rules  transition 
from  a  base  symbol  to  an  indexed  symbol  and  represent  the  beginning  of  a  fragment  from  G.  The 
CONTINUE  rules  use  only  indexed  symbols  and  correspond  to  specific  depth- 1  binary  fragment 
tokens  from  training  trees,  representing  the  internal  continuation  of  a  fragment  in  G.  Finally,  end 
rules  transition  from  an  indexed  symbol  to  a  base  symbol,  representing  the  frontier  of  a  fragment. 

By  construction,  all  derivations  in  G^  will  segment,  as  shown  in  Figure  1(b),  into  regions  cor¬ 
responding  to  tokens  of  fragments  from  the  training  treebank  B.  Let  tt  be  the  map  which  takes 
appropriate  fragments  in  G^  (those  that  begin  and  end  with  base  symbols  and  otherwise  contain 
only  indexed  symbols),  and  maps  them  to  the  corresponding  /  in  G.  We  can  consider  any  deriva¬ 
tion  d^  in  G^  to  be  a  tree  of  fragments  /^,  each  fragment  a  token  of  a  fragment  type  /  =  7r(/^) 
in  the  original  grammar  G.  By  extension,  we  can  therefore  map  any  derivation  d^  in  G^  to  the 
corresponding  derivation  d  =  TT{d^)  inG. 

The  mapping  tt  is  an  onto  mapping  from  G^  to  G.  In  particular,  each  derivation  dinG  has  a  non¬ 
empty  set  of  corresponding  derivations  {d^}  =  TT~^{d)  in  G^,  because  fragments  f  ind  correspond 
to  multiple  fragments  in  G^  that  differ  only  in  their  indexed  symbols  (one  per  occurrence  of 
/  in  B).  Therefore,  the  set  of  derivations  in  G  is  preserved  in  Gb  We  now  discuss  how  weights 
can  be  preserved  under  tt. 


2.3  Equivalence  for  Weighted  Grammars 

In  general,  arbitrary  weight  functions  u  on  fragments  in  G  do  not  decompose  along  the  increased 
locality  of  Gb  However,  we  now  consider  a  usefully  broad  class  of  weighting  schemes  for  which 
the  posterior  probabilities  under  G  of  derivations  d  are  preserved  in  Gb  In  particular,  assume 
that  we  have  a  weighting  u  on  rules  in  G^  which  does  not  depend  on  the  specific  indices  used. 
Therefore,  any  fragment  will  have  a  weight  in  G^  of  the  form: 

=  <^BEGIn(^)  J_  J_  I^CONt(t )  J^a7END(c) 
rec  eSE 

^The  difference  is  that  Goodman  [1996a]  collapses  our  BEGIN  and  END  rules  into  the  binary  productions,  giving  a 
larger  grammar  which  is  less  convenient  for  weighting. 

®Just  half  the  training  set  (19916  trees)  itself  had  1.7  million  depth  1  and  2  unbinarized  rules  compared  to  the  0.9 
million  indexed  symbols  in  (after  graph  packing).  Even  extracting  binarized  fragments  (depth  1  and  2,  with  one 
order  of  parent  annotation)  gives  us  0.75  million  rules,  and,  practically,  we  would  need  fragments  of  greater  depth. 
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where  b  is  the  begin  rule,  r  are  CONTINUE  rules,  and  e  are  END  rules  in  the  fragment  (see 
Figure  1(b)).  Because  to  is  assumed  to  not  depend  on  the  specific  indices,  all  which  correspond 
to  the  same  /  under  tt  will  have  the  same  weight  u}i{f  )  in  . 

In  this  case,  we  can  define  an  induced  weight  for  fragments  /  in  G  by 

uJcif)  =  =  ^(/)^i(/) 

“  ^(/)^begin(^  )  <^cont(t  )  <^end(c  ) 

r'ec  e'SE 

where  now  b',  r'  and  e'  are  non-indexed  type  abstractions  of  /’s  member  productions  in  and 
n(/)  =  |7r“^(/)|  is  the  number  of  tokens  of  /  in  B. 

Under  the  weight  function  ucif),  any  derivation  d  in  G  will  have  weight  which  obeys 

ucid)  =  =lln{f)uji{f) 

f&d  fed 

=  ^ui{d^) 

and  so  the  posterior  P{d\s)  of  a  derivation  d  for  a  sentence  s  will  be  the  same  whether  computed 
in  G  or  G^.  Therefore,  provided  our  weighting  function  on  fragments  /  in  G  decomposes  over 
the  derivational  representation  of  /  in  G^,  we  can  equivalently  compute  the  quantities  we  need  for 
inference  (see  Section  4)  using  G^  instead. 


3  Parameterization  of  Implicit  Grammars 

3.1  Classical  DOPl 

The  original  data-oriented  parsing  model  ‘DOPl’  [Bod,  1993]  is  a  particular  instance  of  the  gen¬ 
eral  weighting  scheme  which  decomposes  appropriately  over  the  implicit  encoding,  described  in 
Section  2.3.  Figure  2  shows  rule  weights  for  DOPl  in  the  parameter  schema  we  have  defined. 
The  END  rule  weight  is  0  or  1  depending  on  whether  A  is  an  intermediate  symbol  or  not.^  The 
local  fragments  in  DOPl  were  flat  (non-binary)  so  this  weight  choice  simulates  that  property  by 
not  allowing  switching  between  fragments  at  intermediate  symbols. 

The  original  DOPl  model  weights  a  fragment  /  in  G  as  cocif)  =  n{f)/s{X),  i.e.,  the  frequency 
of  fragment  /  divided  by  the  number  of  fragments  rooted  at  base  symbol  X.  This  is  simulated 
by  our  weight  choices  (Figure  2)  where  each  fragment  in  G^  has  weight  =  l/s(X) 

and  therefore,  cocif)  =  Z^/^evr-q/)  Given  the  weights  used  for  DOPl,  the 

recursive  formula  for  the  number  of  fragments  s(Xj)  rooted  at  indexed  symbol  X*  (and  for  the 

^Intermediate  symbols  are  those  created  during  binarization. 
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man  [2003])  and  our  model.  Here  s(X)  denotes  the  total  number  of  fragments  rooted  at  base  symbol  X. 


CONTINUE  rule  Xi  -)■  Yj  Z^)  is 

s(X,)  =  (l  +  s(F,))(l  +  s(Zfc)),  (1) 

where  s{Yj)  and  s{Zk)  are  the  number  of  fragments  rooted  at  indexed  symbols  Yj  and  Z^  (non¬ 
intermediate)  respectively.  The  number  of  fragments  s(X)  rooted  at  base  symbol  X  is  then  s(X)  = 

Implicitly  parsing  with  the  full  DOPl  model  (no  sampling  of  fragments)  using  the  weights  in 
Figure  2  gives  a  68%  parsing  accuracy  on  the  WSJ  dev-set.^  This  result  indicates  that  the  weight 
of  a  fragment  should  depend  on  more  than  just  its  frequency. 


3.2  Better  Parameterization 

As  has  been  pointed  out  in  the  literature,  large-fragment  grammars  can  benefit  from  weights  of 
fragments  depending  not  only  on  their  frequency  but  also  on  other  properties.  For  example,  Bod 
[2001]  restricts  the  size  and  number  of  words  in  the  frontier  of  the  fragments,  and  Collins  and 
Duffy  [2002]  and  Goodman  [2003]  both  give  larger  fragments  smaller  weights.  Our  model  can 
incorporate  both  size  and  lexical  properties.  In  particular,  we  set  a;coNT(T)  for  each  binary  CON¬ 
TINUE  rule  r  to  a  learned  constant  cCbody.  and  we  set  the  weight  for  each  rule  with  a  POS  parent  to  a 
constant  cclex  (see  Figure  2).  Fractional  values  of  these  parameters  allow  the  weight  of  a  fragment 
to  depend  on  its  size  and  lexical  properties. 

^For  DOPl  experiments,  we  use  no  symbol  rehnement.  We  annotate  with  full  left  binarization  history  to  imitate  the 
flat  nature  of  fragments  in  DOPl .  We  use  mild  coarse-pass  pruning  (Section  4.1)  without  which  the  basic  all-fragments 
chart  does  not  fit  in  memory.  Standard  WSJ  treebank  splits  used:  sec  2-21  training,  22  dev,  23  test. 
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Rule  score:  r{A  B  C,i,k,j)  =  By  Cz)I{ByA,k)I{Cz,k,j) 

X  y  z 

Max-Constituent:  q{A,i,j)  =  =  argmaxX;g(c) 

Max-Rule-Sum:  q{A  B  C,i,  k,j)  =  An)  =  argmaxX;g(e) 

Max- Variational:  q{A  ^  B  C,i,k,j)  =  tmax  =  argmaxH g(e) 

Figure  3:  Inference:  Different  objectives  for  parsing  with  posteriors.  A,  B,  C  are  base  symbols,  A^,  By,  Cz  are 
indexed  symbols  and  i,j,k  are  between-word  indices.  Hence,  {Ax,i,j)  represents  a  constituent  labeled  with  A^ 
spanning  words  i  to  j.  I(Ax,  i,j)  and  0(Ax,  i,j)  denote  the  inside  and  outside  scores  of  this  constituent,  respectively. 
For  brevity,  we  write  c  =  {A,  i,j)  and  e  =  {A  —>■  B  C,i,  k,j).  Also,  tmax  is  the  highest  scoring  parse.  Adapted  from 
Petrov  and  Klein  [2007] . 

Another  parameter  we  introduce  is  a  ‘switching-penalty’  Cgp  for  the  end  rules  (Figure  2).  The 
DOPl  model  uses  binary  values  (0  if  symbol  is  intermediate,  1  otherwise)  as  the  end  rule  weight, 
which  is  equivalent  to  prohibiting  fragment  switching  at  intermediate  symbols.  We  learn  a  frac¬ 
tional  constant  that  allows  (but  penalizes)  switching  between  fragments  at  annotated  symbols 
through  the  formulation  Cgyi^Xi^iterraediate)  f  tlgp  and  Cgpi^XyiQ^i— intermediate)  f  tlgp,  Xhis 
feature  allows  fragments  to  be  assigned  weights  based  on  the  binarization  status  of  their  nodes. 

With  the  above  weights,  the  recursive  formula  for  s(Xi),  the  total  weighted  number  of  fragments 
rooted  at  indexed  symbol  Xi,  is  different  from  DOPl  (Equation  1).  For  rule  Xi  — )■  Yj  Zk,  it  is 

s(Aij)  ^BODY- (C'sp(Fj')  T  s(Yj))(^Cgp(^Zi~)  T  s(^Zf^)). 

The  formula  uses  cUlex  in  place  of  cUbody  if  r  is  a  lexical  rule  (Figure  2). 

The  resulting  grammar  is  primarily  parameterized  by  the  training  treebank  B.  However,  each  set¬ 
ting  of  the  hyperparameters  (cubody,  i^lex,  a-sp)  defines  a  different  conditional  distribution  on  trees. 
We  choose  amongst  these  distributions  by  directly  optimizing  parsing  FI  on  our  development  set. 
Because  this  objective  is  not  easily  differentiated,  we  simply  perform  a  grid  search  on  the  three 
hyperparameters.  The  tuned  values  are  cUbody  =  0.35,  cUlex  =  0.25  and  a^p  =  0.018.  For  general¬ 
ization  to  a  larger  parameter  space,  we  would  of  course  need  to  switch  to  a  learning  approach  that 
scales  more  gracefully  in  the  number  of  tunable  hyperparameters.^ 


4  Efficient  Inference 


The  previously  described  implicit  grammar  defines  a  posterior  distribution  P{d^s)  over  a  sen¬ 
tence  s  via  a  large,  indexed  PCFG.  This  distribution  has  the  property  that,  when  marginalized,  it  is 
equivalent  to  a  posterior  distribution  P{d\s)  over  derivations  in  the  correspondingly  weighted  all- 
fragments  grammar  G.  However,  even  with  an  explicit  representation  of  G,  we  would  not  be  able 

®Note  that  there  has  been  a  long  history  of  DOP  estimators.  The  generative  DOPl  model  was  shown  to  be  incon¬ 
sistent  by  Johnson  [2002].  Later,  Zollmann  and  Sima’ an  [2005]  presented  a  statistically  consistent  estimator,  with  the 
basic  insight  of  optimizing  on  a  held-out  set.  Our  estimator  is  not  intended  to  be  viewed  as  a  generative  model  of  trees 
at  all,  but  simply  a  loss-minimizing  conditional  distribution  within  our  parametric  family. 
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dev  (<  40) 

test  (<  40) 

test 

(all) 

Model 

FI 

EX 

FI 

EX 

El 

EX 

Constituent 

88.4 

33.7 

88.5 

33.0 

87.6 

30.8 

Rule-Sum 

88.2 

34.6 

88.3 

33.8 

87.4 

31.6 

Variational 

87.7 

34.4 

87.7 

33.9 

86.9 

31.6 

Table  1:  All-fragments  WSJ  results  (accuracy  FI  and  exact  match  EX)  for  the  constituent,  rule-sum  and  variational 
objectives,  using  parent  annotation  and  one  level  of  markovization. 


to  tractably  compute  the  parse  that  maximizes  P{t\s)  =  Pid\s)  =  Pid^\s)  [Sima’ an, 

1996].  We  therefore  approximately  maximize  over  trees  by  eomputing  various  existing  approxi¬ 
mations  to  P(t|s)  (Figure  3).  Goodman  [1996b],  Petrov  and  Klein  [2007],  and  Matsuzaki  et  al. 
[2005]  deseribe  the  details  of  eonstituent,  rule-sum  and  variational  objeetives  respeetively.  Note 
that  all  inferenee  methods  depend  on  the  posterior  P{t\s)  only  through  marginal  expeetations  of 
labeled  eonstituent  eounts  and  anehored  loeal  binary  tree  eounts,  whieh  are  easily  eomputed  from 
P{d^\s)  and  equivalent  to  those  from  P{d\s).  Therefore,  no  additional  approximations  are  made 
in  over  G. 

As  shown  in  Table  1,  our  model  (an  all-fragments  grammar  with  the  weighting  seheme  shown  in 
Figure  2)  aehieves  an  aeeuraey  of  88.5%  (using  simple  parent  annotation)  whieh  is  4-5%  (absolute) 
better  than  the  reeent  TSG  work  [Zuidema,  2007,  Cohn  et  al.,  2009,  Post  and  Gildea,  2009]  and  also 
approaehes  state-of-the-art  refinement-based  parsers  (e.g.,  Charniak  and  Johnson  [2005],  Petrov 
and  Klein  [2007] ).i° 


4.1  Coarse-to-Fine  Inference 

Coarse-to-fine  inferenee  is  a  well-established  way  to  aeeelerate  parsing.  Charniak  et  al.  [2006] 
introduced  multi-level  eoarse-to-fine  parsing,  whieh  extends  the  basie  pre -parsing  idea  by  adding 
more  rounds  of  pruning.  Their  pruning  grammars  were  eoarse  versions  of  the  raw  treebank  gram¬ 
mar.  Petrov  and  Klein  [2007]  propose  a  multi-stage  eoarse-to-fine  method  in  whieh  they  eonstruet  a 
sequenee  of  inereasingly  refined  grammars,  reparsing  with  eaeh  refinement.  In  partieular,  in  their 
approaeh,  whieh  we  adopt  here,  eoarse-to-fine  pruning  is  used  to  quiekly  eompute  approximate 
marginals,  whieh  are  then  used  to  prune  subsequent  seareh.  The  key  ehallenge  in  eoarse-to-fine 
inferenee  is  the  eonstruetion  of  eoarse  models  whieh  are  mueh  smaller  than  the  target  model,  yet 
whose  posterior  marginals  are  elose  enough  to  prune  with  safely. 

Our  grammar  G^  has  a  very  large  number  of  indexed  symbols,  so  we  use  a  eoarse  pass  to  prune 
away  their  unindexed  abstraetions.  The  simple,  intuitive,  and  effeetive  ehoiee  for  sueh  a  eoarse 
grammar  (7*"  is  a  minimal  PCFG  grammar  eomposed  of  the  base  treebank  symbols  X  and  the 
minimal  depth- 1  binary  rules  X  —f  Y  Z  (and  with  the  same  level  of  annotation  as  in  the  full 
grammar).  If  a  partieular  base  symbol  X  is  pruned  by  the  eoarse  pass  for  a  partieular  span  (i,  j) 

^®A11  our  experiments  use  the  constituent  objective  except  when  we  report  results  for  max-rule-sum  and  max- 
variational  parsing  (where  we  use  the  parameters  tuned  for  max-constituent,  therefore  they  unsurprisingly  do  not 
perform  as  well  as  max-constituent).  Evaluations  use  EVALB,  see  http :  / / nip  .  cs  .  nyu  .  edu/evalb/. 
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Coarse-pass  Log  Posterior  Threshold  (PT) 


Figure  4:  Effect  of  coarse-pass  pruning  on  parsing  accuracy  (for  WSJ  dev-set,  <  40  words).  Pruning  increases  to  the 
left  as  log  posterior  threshold  (PT)  increases. 


(i.e.,  the  posterior  marginal  P{X,  i,j\s)  is  less  than  a  certain  threshold),  then  in  the  full  grammar 
G^,  we  do  not  allow  building  any  indexed  symbol  Xi  of  type  X  for  that  span.  Hence,  the  projection 
map  for  the  coarse-to-fine  model  is  vr*"  :  Xi  {indexed  symbol)  — )■  X  {base  symbol). 

We  achieve  a  substantial  improvement  in  speed  and  memory-usage  from  the  coarse-pass  pruning. 
Speed  increases  by  a  factor  of  40  and  memory-usage  decreases  by  a  factor  of  10  when  we  go 
from  no  pruning  to  pruning  with  a  —6.2  log  posterior  threshold. Figure  4  depicts  the  variation  in 
parsing  accuracies  in  response  to  the  amount  of  pruning  done  by  the  coarse-pass.  Higher  posterior 
pruning  thresholds  induce  more  aggressive  pruning.  Here,  we  observe  an  effect  seen  in  previous 
work  (Charniak  et  al.  [1998],  Petrov  and  Klein  [2007],  Petrov  et  al.  [2008]),  that  a  certain  amount 
of  pruning  helps  accuracy,  perhaps  by  promoting  agreement  between  the  coarse  and  full  grammars 
(model  intersection).  However,  these  ‘fortuitous’  search  errors  give  only  a  small  improvement 
and  the  peak  accuracy  is  almost  equal  to  the  parsing  accuracy  without  any  pruning  (as  seen  in 
Figure  5).  To  generate  the  graph  in  Figure  5,  we  used  training  and  dev  sentences  of  length  <  20, 
because  of  time  and  memory  constraints  of  low-pruning  experiments.  However,  we  also  ran  one 
full-sized  no-pruning  experiment  with  training  on  all  sentences  and  testing  on  sentences  of  length 
<  40.  The  no-pruning  test-set  accuracy  is  88.1%  FI  as  compared  to  88.5%  FI  when  pruned  with 
a  —6.2  log  posterior  threshold  (which  is  the  result  shown  in  Table  1).  This  outcome  suggests  that 
the  coarse-pass  pruning  is  critical  for  tractability  but  not  for  performance. 
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Figure  5:  Effect  of  coarse-pass  pruning  on  parsing  accuracy  (WSJ,  training  <  20  words,  tested  on  dev-set  <  20  words). 
This  graph  shows  that  the  fortuitous  improvement  due  to  pruning  is  very  small  and  that  the  peak  accuracy  is  almost 
equal  to  the  accuracy  without  pruning  (the  dotted  line). 


4.2  Packed  Graph  Encoding 

The  implicit  all-fragments  approach  (Section  2.2)  avoids  explicit  extraction  of  all  rule  fragments. 
However,  the  number  of  indexed  symbols  in  our  implicit  grammar  is  still  large,  because  every 
node  in  each  training  tree  (i.e.,  every  symbol  token)  has  a  unique  indexed  symbol.  We  have  around 
1.9  million  indexed  symbol  tokens  in  the  word-level  parsing  model  (this  number  increases  further 
to  almost  12.3  million  when  we  parse  character  strings  in  Section  5.1).  This  large  symbol  space 
makes  parsing  slow  and  memory-intensive. 

We  reduce  the  number  of  symbols  in  our  implicit  grammar  by  applying  a  compact,  packed 
graph  encoding  to  the  treebank  training  trees.  We  collapse  the  duplicate  subtrees  (fragments  that 
bottom  out  in  terminals)  over  all  training  trees.  This  keeps  the  grammar  unchanged  because  in  an 
tree-substitution  grammar,  a  node  is  defined  (identified)  by  the  subtree  below  it.  We  maintain  a 
hashmap  on  the  subtrees  which  allows  us  to  easily  discover  the  duplicates  and  bin  them  together. 
The  collapsing  converts  all  the  training  trees  in  the  treebank  to  a  graph  with  multiple  parents 
for  some  nodes  as  shown  in  Figure  6.  This  technique  reduces  the  number  of  indexed  symbols 
significantly  as  shown  in  Table  2  (1.9  million  goes  down  to  0.9  million,  reduction  by  a  factor  of 
2.1).  This  reduction  increases  parsing  speed  by  a  factor  of  1 .4  (and  by  a  factor  of  20  for  character- 
level  parsing,  see  Section  5.1)  and  reduces  memory  usage  to  under  4GB. 

We  store  the  duplicate-subtree  counts  for  each  indexed  symbol  of  the  collapsed  graph  (using  a 
hashmap).  When  calculating  the  number  of  fragments  s(Xj)  parented  by  an  indexed  symbol  Xi 
(see  Section  3.2),  and  when  calculating  the  inside  and  outside  scores  during  inference,  we  ac- 

^'We  calculated  these  improvement  factors  using  a  smaller  experiment  with  full  training  and  sixty  30-word  test 
sentences. 
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The  model 


VBD  NP 


parsed  a  sentence 


V. 


present  jfj  fjjg  treebank 


tree-to-graph  encoding 


J 


present  ipg  treebank 


Figure  6:  Collapsing  the  duplicate  training  subtrees  converts  them  to  a  graph  and  reduces  the  number  of  indexed 
symbols  significantly. 


Parsing  Model 

No.  of  Indexed  Symbols 

Word-level  Trees 

1,900,056 

Word-level  Graph 

903,056 

Character-level  Trees 

12,280,848 

Character-level  Graph 

1,109,399 

Table  2:  Number  of  indexed  symbols  for  word-level  and  character-level  parsing  and  their  graph  versions  (for  all- 
fragments  grammar  with  parent  annotation  and  one  level  of  markovization). 


count  for  the  collapsed  subtree  tokens  by  expanding  the  counts  and  scores  using  the  corresponding 
multiplicities.  Therefore,  we  achieve  the  compaction  with  negligible  overhead  in  computation. 


5  Improved  Treebank  Representations 

5.1  Character-Level  Parsing 

The  all-fragments  approach  to  parsing  has  the  added  advantage  that  parsing  below  the  word  level 
requires  no  special  treatment,  i.e.,  we  do  not  need  an  explicit  lexicon  when  sentences  are  considered 
as  strings  of  characters  rather  than  words. 
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s 


(w)  H  e{/w)  {w)  a  t  e{/w) 


Figure  7;  Character-level  parsing:  treating  the  sentence  as  a  string  of  characters  instead  of  words. 


dev  (<  40) 

test  (<  40) 

test 

(all) 

Model 

FI 

EX 

FI 

EX 

El 

EX 

Constituent 

88.2 

33.6 

88.0 

31.9 

87.1 

29.8 

Rule-Sum 

88.0 

33.9 

87.8 

33.1 

87.0 

30.9 

Variational 

87.6 

34.4 

87.2 

32.3 

86.4 

30.2 

Table  3:  All-fragments  WSJ  results  for  the  character-level  parsing  model,  using  parent  annotation  and  one  level  of 
markovization. 


Unknown  words  in  test  sentences  (unseen  in  training)  are  a  major  issue  in  parsing  systems  for 
which  we  need  to  train  a  complex  lexicon,  with  various  unknown  classes  or  suffix  tries.  Smoothing 
factors  need  to  be  accounted  for  and  tuned.  With  our  implicit  approach,  we  can  avoid  training  a 
lexicon  by  building  up  the  parse  tree  from  characters  instead  of  words.  As  depicted  in  Figure  7, 
each  word  in  the  training  trees  is  split  into  its  corresponding  characters  with  start  and  stop  boundary 
tags  (and  then  binarized  in  a  standard  right-branching  style).  A  test  sentence’s  words  are  split  up 
similarly  and  the  test-parse  is  built  from  training  fragments  using  the  same  model  and  inference 
procedure  as  defined  for  word-level  parsing  (see  Sections  2,  3  and  4).  The  lexical  items  (alphabets, 
digits  etc.)  are  now  all  known,  so  unlike  word-level  parsing,  no  sophisticated  lexicon  is  needed. 

We  choose  a  slightly  richer  weighting  scheme  for  this  representation  by  extending  the  two-weight 
schema  for  continue  rules  (cclex  and  ccbody)  to  a  three-weight  one:  oJlex,  and  ccsent  for 

CONTINUE  rules  in  the  lexical  layer,  in  the  portion  of  the  parse  that  builds  words  from  characters, 
and  in  the  portion  of  the  parse  that  builds  the  sentence  from  words,  respectively.  The  tuned  values 
are  ccsent  =  0.35,  ccword  =  0.15,  u^ex  =  0.95  and  Ugp  =  0.  The  character-level  model  achieves  a 
parsing  accuracy  of  88.0%  (see  Table  3),  despite  lacking  an  explicit  lexicon}^ 

Character-level  parsing  expands  the  training  trees  (see  Figure  7)  and  the  already  large  indexed 
symbol  space  size  explodes  (1.9  million  increases  to  12.3  million,  see  Table  2).  Fortunately,  this  is 
where  the  packed  graph  encoding  (Section  4.2)  is  most  effective  because  duplication  of  character 
strings  is  high  (e.g.,  suffixes).  The  packing  shrinks  the  symbol  space  size  from  12.3  million  to  1.1 

^^Note  that  the  word-level  model  yields  a  higher  accuracy  of  88.5%,  but  uses  50  complex  unknown  word  categories 
based  on  lexical,  morphological  and  position  features,  as  described  in  Petrov  et  al.  [2006].  Cohn  et  al.  [2009]  also  uses 
this  lexicon  based  on  unknown  word-classes. 
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million,  a  reduction  by  a  factor  of  1 1 .  This  reduction  increases  parsing  speed  by  almost  a  factor  of 
20  and  brings  down  memory-usage  to  under  8GB. 


5.2  Basic  Refinement:  Parent  Annotation  and  Horizontal  Markovization 

In  a  pure  all-fragments  approach,  compositions  of  units  which  would  have  been  independent  in  a 
basic  PCFG  are  given  joint  scores,  allowing  the  representation  of  certain  non-local  phenomena, 
such  as  lexical  selection  or  agreement,  which  in  fully  local  models  require  rich  state-splitting  or 
lexicalization.  However,  at  substitution  sites,  the  coarseness  of  raw  unrefined  treebank  symbols 
still  creates  unrealistic  factorization  assumptions.  A  standard  solution  is  symbol  refinement;  John¬ 
son  [1998]  presents  the  particularly  simple  case  of  parent  annotation,  in  which  each  node  is  marked 
with  its  parent  in  the  underlying  treebank.  It  is  reasonable  to  hope  that  the  gains  from  using  large 
fragments  and  the  gains  from  symbol  refinement  will  be  complementary.  Indeed,  previous  work 
has  shown  or  suggested  this  complementarity.  Sima’ an  [2000]  showed  modest  gains  from  enrich¬ 
ing  structural  relations  with  semi-lexical  (pre-head)  information.  Charniak  and  Johnson  [2005] 
showed  accuracy  improvements  from  composed  local  tree  features  on  top  of  a  lexicalized  base 
parser.  Zuidema  [2007]  showed  a  slight  improvement  in  parsing  accuracy  when  enough  fragments 
were  added  to  learn  enrichments  beyond  manual  refinements.  Our  work  reinforces  this  intuition 
by  demonstrating  how  complementary  they  are  in  our  model  (~20%  error  reduction  on  adding 
refinement  to  an  all-fragments  grammar,  as  shown  in  the  last  two  rows  of  Table  4). 

Table  4  shows  results  for  a  basic  PCFG,  and  its  augmentation  with  either  basic  refinement  (parent 
annotation  and  one  level  of  markovization),  with  all-fragments  rules  (as  in  previous  sections),  or 
both.  The  basic  incorporation  of  large  fragments  alone  does  not  yield  particularly  strong  perfor¬ 
mance,  nor  does  basic  symbol  refinement.  However,  the  two  approaches  are  quite  additive  in  our 
model  and  combine  to  give  nearly  state-of-the-art  parsing  accuracies. 


5.3  Additional  Deterministic  Refinement 

Basic  symbol  refinement  (parent  annotation),  in  combination  with  all-fragments,  gives  test-set 
accuracies  of  88.5%  (<  40  words)  and  87.6%  (all),  shown  as  the  Basic  Refinement  model  in  Ta¬ 
ble  5.  Klein  and  Manning  [2003]  describe  a  broad  set  of  simple,  deterministic  symbol  refinements 
beyond  parent  annotation.  We  included  ten  of  their  simplest  annotation  features,  namely:  UNARY- 
DT,  UNARY-RB,  SPLIT-IN,  SPLIT-AUX,  SPLIT-CC,  SPLIT-%,  GAPPED-S,  POSS-NR  BASE-NP  and 
DOMINATES-V.  None  of  these  annotation  schemes  use  any  head  information.  This  additional  an¬ 
notation  (see  Additional  Refinement,  Table  5)  improves  the  test-set  accuracies  to  88.7%  (<  40 
words)  and  88.1%  (all),  which  is  equal  to  a  strong  lexicalized  parser  [Collins,  1999],  even  though 
our  model  does  not  use  lexicalization  or  latent  symbol-split  induction.''^ 

^^Full  char-level  experiments  (w/o  packed  graph  encoding)  could  not  be  run  even  with  50GB  of  memory.  We 
calculate  the  improvement  factors  using  a  smaller  experiment  with  70%  training  and  fifty  20-word  test  sentences. 

^"^We  further  found  that  by  pre-transforming  the  WSJ  treebank  with  richer  annotation  from  previous  work  (such  as 
the  splits  learned  via  hard-EM  and  4  split-merge  rounds  of  the  Berkeley  parser  [Petrov  et  al.,  2006]),  we  can  obtain 
state-of-the-art  accuracies  of  up  to  90%  FI  with  no  change  to  our  simple  parser. 
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Parsing  Model 

FI 

No  Refinement  (P=0,  H=0)* 

71.3 

Basie  Refinement  (P=l,  H=l)* 

80.0 

All-Fragments  -i-  No  Refinement  (P=0,  H=0) 

85.7 

All-Fragments  -i-  Basie  Refinement  (P=l,  H=l) 

88.4 

Table  4:  FI  for  a  basic  PCFG,  and  incorporation  of  basic  refinement,  all-fragments  and  both,  for  WSJ  dev-set  (<  40 
words).  P  =  1  means  parent  annotation  of  all  non-terminals,  including  the  preterminal  tags.  H  =  1  means  one  level 
of  markovization.  *Results  from  Klein  and  Manning  [2003]. 


89 
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85 
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83 

0  20  40  60  80  100 

Percentage  of  WSJ  sections  2-21  used  for  training 

Figure  8:  Parsing  accuracy  FI  on  the  WSJ  dev-set  (<  40  words)  increases  with  increasing  percentage  of  training  data. 

6  Other  Results 

6.1  Parsing  Speed  and  Memory  Usage 

The  word-level  parsing  model  using  the  whole  training  set  (39832  trees,  all-fragments)  takes  ap¬ 
proximately  3  hours  on  the  WSJ  test  set  (2245  trees  of  <40  words),  whieh  is  equivalent  to  roughly 
5  seeonds  of  parsing  time  per  sentenee;  and  runs  in  under  4GB  of  memory.  The  eharaeter-level 
version  takes  about  twiee  the  time  and  memory.  This  novel  traetability  of  an  all-fragments  gram¬ 
mar  is  aehieved  using  both  eoarse-pass  pruning  and  paeked  graph  eneoding.  Miero-optimization 
may  further  improve  speed  and  memory  usage. 
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test  (<  40) 

test  (all) 

Parsing  Model 

FI  EX 

FI 

EX 

FRAGMENT-BASED  PARSERS 

Zuidema  [2007] 

-  - 

83.8* 

26.9* 

Cohn  et  al.  [2009] 

-  - 

84.0 

- 

Post  and  Gildea  [2009] 

82.6 

- 

- 

THIS  WORK 

All-Fragments 
-1-  Basic  Refinement 

88.5  33.0 

87.6 

30.8 

-1-  Additional  Refinement 

88.7  33.8 

88.1 

31.7 

REFINEMENT-BASED  PARSERS 

Collins  [1999] 

88.6 

88.2 

- 

Petrov  and  Klein  [2007] 

90.6  39.1 

90.1 

37.1 

Table  5;  Our  WSJ  test  set  parsing  accuracies,  compared  to  recent  fragment-based  parsers  and  top  refinement-based 
parsers.  Basic  Refinement  is  our  all-fragments  grammar  with  parent  annotation.  Additional  Refinement  adds  deter¬ 
ministic  refinement  of  Klein  and  Manning  [2003]  (Section  5.3).  *Results  on  the  dev-set  (<  100). 


6.2  Training  Size  Variation 

Figure  8  shows  how  WSJ  parsing  accuracy  increases  with  increasing  amount  of  training  data  (i.e., 
percentage  of  WSJ  sections  2-21).  Even  if  we  train  on  only  10%  of  the  WSJ  training  data  (3983 
sentences),  we  still  achieve  a  reasonable  parsing  accuracy  of  nearly  84%  (on  the  development  set, 
<  40  words),  which  is  comparable  to  the  full-system  results  obtained  by  Zuidema  [2007],  Cohn 
et  al.  [2009]  and  Post  and  Gildea  [2009]. 


6.3  Other  Language  Treebanks 

On  the  French  and  German  treebanks  (using  the  standard  dataset  splits  mentioned  in  Petrov  and 
Klein  [2008]),  our  simple  all-fragments  parser  achieves  accuracies  in  the  range  of  top  refinement- 
based  parsers,  even  though  the  model  parameters  were  tuned  out  of  domain  on  WSJ.  For  German, 
our  parser  achieves  an  FI  of  79.8%  compared  to  81.5%  by  the  state-of-the-art  and  substantially 
more  complex  Petrov  and  Klein  [2008]  work.  For  French,  our  approach  yields  an  FI  of  78.0%  vs. 
80.1%  by  Petrov  and  Klein  [2008].'^ 


7  Conclusion 


Our  approach  of  using  all  fragments,  in  combination  with  basic  symbol  refinement,  and  even  with¬ 
out  an  explicit  lexicon,  achieves  results  in  the  range  of  state-of-the-art  parsers  on  full  scale  tree- 

^^All  results  on  the  test  set  (<  40  words). 
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banks,  across  multiple  languages.  The  main  take-away  is  that  we  can  achieve  such  results  in  a 
very  knowledge-light  way  with  (1)  no  latent- variable  training,  (2)  no  sampling,  (3)  no  smoothing 
beyond  the  existence  of  small  fragments,  and  (4)  no  explicit  unknown  word  model  at  all.  While 
these  methods  offer  a  simple  new  way  to  construct  an  accurate  parser,  we  believe  that  this  general 
approach  can  also  extend  to  other  large-fragment  tasks,  such  as  machine  translation. 
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